arXiv:1505.01665vl [math.ST] 7 May 2015 


Bayesian Analysis (2015) 


10 , Number 2, pp. 275-296 


Dirichlet Process Hidden Markov Multiple 
Change-point Model 


Stanley I. M. Ko*, Terence T. L. Chong'f, and Pulak Ghosh* 


Abstract. This paper proposes a new Bayesian multiple change-point model 
which is based on the hidden Markov approach. The Dirichlet process hidden 
Markov model does not require the specification of the number of change-points 
a priori. Hence our model is robust to model specification in contrast to the fully 
parametric Bayesian model. We propose a general Markov chain Monte Carlo al¬ 
gorithm which only needs to sample the states around change-points. Simulations 
for a normal mean-shift model with known and unknown variance demonstrate 
advantages of our approach. Two applications, namely the coal-mining disaster 
data and the real United States Gross Domestic Product growth, are provided. 
We detect a single change-point for both the disaster data and US GDP growth. 
All the change-point locations and posterior inferences of the two applications are 
in line with existing methods. 

Keywords: Change-point, Dirichlet process, Hidden Markov model, Markov 
chain Monte Carlo, Nonparametric Bayesian. 


1 Introduction 

The earliest Bayesian change-point model is explored by Chernoff and Zacks (1964), who 
assume a constant probability of change at each point in time. Smith (1975) investigates 
the single change-point model under different assumptions of model parameters. Carlin 
et al. (1992) assume that the structural parameters are independent of the change-points 
and introduce the Markov chain Monte Carlo sampling method to derive the posterior 
distributions. Stephens (1994) further applies the Markov chain Monte Carlo method to 
the case of multiple changes. Chib (1998) allows the change-point probability to depend 
on the regime between two adjacent change-points. Koop and Potter (2007) propose 
the Poisson hierarchical prior for durations in the change-point model that allows the 
number of change-points to be unknown. More recent works on the Bayesian change- 
point model include Wang and Zivot (2000), Giordani and Kohn (2008), Pesaran et al. 
(2006), Maheu and Gordon (2008) and Geweke and Yu (2011). 

In this paper we follow the modeling strategy of Chib (1998) which is one of the 
most popular Bayesian change-point models. He introduces a discrete random variable 
indicating the regime from which a particular observation is drawn. Specifically, let 


* Department of Finance and Business Economics, University of Macau, Macau, 
StanleyKo@umac.mo 

^Department of Economics and Institute of Global Economics and Finance, The Chinese University 
of Hong Kong, Hong Kong, and Department of International Economics and Trade, Nanjing University, 
China, chong2064@cuhk.edu.hk 

■^Department of Quantitative Methods & Information Systems, Indian Institute of Management at 
Bangalore, India, pulak.ghosh@iimb.ernet.in 


©2015 International Society for Bayesian Analysis 


DOI: 10.1214/14-BA910 



276 Dirichlet Process Hidden Markov Multiple Change-point Model 


Y n = ( 2/1 , 3 / 2 , - - -, Vn)' be the observed time series, such that the density of yt conditioned 
on Y t - 1 = ( 2 / 1 , 2 / 2 , ■ • • iVt-i)' depends on the parameter 6 whose value changes at an 
unknown time period 1 < t\ < ■ ■ ■ < Tk < n and remains constant within each regime, 
that is, 


p(yt 

Y t -i,0i) 

if t < Ti, 

p(yt 

\Y t -i,e 2 ) 

if n < t < r 2 , 

p{vt 

l*i-iA) 

if Tfc_i <t <Tk 

p(yt 

Y t -i,6k+i) 

if Tk <t <n, 


where Oi £ M 1 is an l dimension vector, i = 1, 2, ...,& + 1. Note that we consider in 
this paper the change-point problem when the data are assumed to be generated by a 
parametric model where the unknown parameter 9i changes with respect to different 
regimes. Let St be the discrete indicator variable such that 


Vt | St ~p{yt | Y t -i,9 St ), 


( 2 ) 


where St takes values in {1,2,..., k, k + 1}. The indicator variable s t is modeled as a 
discrete time, discrete-state Markov process with the constrained transition probability 
matrix 

fpn P 12 0 0 \ 

0 P 22 P 23 • • • 0 


p = 


( 3 ) 


Pkk Pk{k+1) 

\ 0 0 ••• 0 1 ) 


where p v j = pr(s* = j \ St- 1 = i) is the probability of moving to regime j at time t 
given that the regime at time t — 1 is i. With this parameterization, the ith change-point 
occurs at n if s Ti = i and s Ti _|_i = i + 1. 


As pointed out in Chib (1998), the above is a hidden Markov model where the 
transition matrix of the hidden state St is restricted as in (3). Hence, Chib’s multiple 
change-point model inherits the limitation of the hidden Markov model in that the 
number of states has to be specified in advance. In light of this, Chib (1998) suggests 
to select from alternative models (e.g. one change-point vs. multiple change-points) ac¬ 
cording to the Bayes factors. In this paper, we introduce the Dirichlet process hidden 
Markov model (DPHMM) with left-to-right transition dynamic, without imposing re¬ 
strictions on the number of hidden states. The use of the DPHMM has the following 
appealing features: 


1. We do not have to specify the number of states a priori. The information provided 
by the observations determines the states endogenously. Hence, our method can 
be regarded as semiparametric. 

2. Our modeling approach facilitates the sampling of states since we only need to 
sample the states around change-points. 
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We note that Kozumi and Hasegawa (2000) propose a method similar to ours in that 
they utilize a Dirichlet process prior for 9 , but in a mixture model. 

The rest of the paper is organized as follows. Section 2 provides a brief introduction 
of the Dirichlet process. Section 3 incorporates the Dirichlet process into the change- 
point model. The general Markov chain Monte Carlo sampler is discussed in Section 4. 
A Monte Carlo study of the normal mean-shift model is conducted in Section 5. Section 6 
discusses learning of DPHMM parameters. Section 7 provides applications of our model 
and Section 8 concludes the paper. 


2 The Dirichlet Process 

Our new method employs the Dirichlet process technique which is widely used in non- 
parametric Bayesian models. The Dirichlet process prior is first proposed by Ferguson 
(1973). He derives the Dirichlet process prior as the prior on the unknown probability 
measure space with respect to some measurable space (Cl, J-). Hence the Dirichlet pro¬ 
cess is a distribution over probability measures. Blackwell and MacQueen (1973) show 
that the Dirichlet process can be represented by the Polya urn model. Sethuraman 
(1994) develops the constructive sticking-breaking definition. 

In the present study, we assume a Dirichlet process prior to each row of the transition 
matrix. The Dirichlet process is best defined here as the infinite limit of finite mixture 
models (Neal (1992), Neal (2000) and Beal et al. (2002)). To illustrate the idea, let us 
first consider the case with a finite number of states. With the left-to-right restriction to 
the transition dynamic, a particular state St_i = i will either stay at the current state 
i or transit to a state j > i. A left-to-right Markov chain with k states will typically 


have the following upper triangular 

transition matrix 



(pu 

Pl2 

Pis ■■■ 

Plk\ 


0 

P22 

P 23 ■■■ 

P2k 

P = 

0 

0 

P33 ■ ‘ ■ 

P3k 


\o 

0 


Pkk J 


where the summation of each row equals 1. Note that the left-to-right Markov chain 
here is different from Chib’s restricted band transition matrix (3). Here, the number of 
states k is not necessarily the number of regimes as is the case in Chib’s model. 

Let Pi = (0, ... ,Pu,Pi(i+ 1 ), • • • ,Pik) be the transition probabilities of the ith row of 
the transition matrix (4). Suppose we draw m samples {ci,..., c m } of St+i given St = i 
with probability profile p, . The joint distribution of the sample is thus 

k 

pr(ci,...,c m | p0 = 11^’ ( 5 ) 

j=i 

where nrij denotes the number of samples that take state j, j = i ,..., k. We assume a 
symmetric Dirichlet prior 7r(pj | /3) for p, with positive concentration parameter /3: 
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p, | 0 ~ Dirichlet (0/(k — i + 1),..., 0/(k — i + 1)) = 




(— 8 —) 

i+i J 


n 


(3/(k-i+ 1)-1 


fe-i+1 

j=i 


With the Dirichlet prior, we can analytically integrate out p* such that 
pr(ci,..., c m | 0) = J pr(ci,..., c m \ p il 0)dir{p l \ 0) 


r(/3) 


r(m + 0 ) 


n 


— ^ ( m i + k-i+l ) 

7 » \ 

i+1 J 


( 6 ) 


(7) 


The conditional probability of a sample Cd € {ci,..., c m } given all other samples is thus 

m-dj + 0/{k + i-l) 


pr(cd = j | c_ d ) = 


m — 1 + , 


( 8 ) 


where c_d denotes the sample set with Cd deleted, and m-dj is the number of samples 
in C-d that take state j. 

Taking the limit of equation (8) as k tends to infinity, we have 


m-i+p j e {M+l,...,/c}, ^ 

for all potential states . 

Note that the probability that Cd takes an existing state, say j, is proportional to m-dj, 
which implies that Cd is more likely to choose an already popular state. In addition, the 
probability that a new state (i.e. k + 1) takes place is proportional to 0. Hence, there 
are potentially many states available, with infinite dimension transition matrix 


pr(cd = j | c- d ) = 


(pn 

Pl2 

Pis 

■\ 

0 

P22 

P23 


0 

0 

Pss 


V ; 





( 10 ) 


The actual state space can be regarded as consisting of an infinite number of states, only 
a finite number of which are actually associated with the data. Therefore, the number 
of states is endogenously determined. 


3 The Dirichlet Process Hidden Markov Multiple 
Change-point Model and the State Evolution 

Let us now turn to the proposed multiple change-point model and discuss a particular 
state evolution. Suppose we have already generated the hidden states up to St = i- We 
impose the Dirichlet process as described in Section 2 to st+i. In the change-point model, 
the transitions that have existed so far from state i are only self transitions. With the 
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left-to-right restriction, we will neither see a backward transition, i.e., transition from 
state i to some previous states, nor a forward transition, i.e., transition to some future 
states. The counts of the existing transitions from state i will be used as the counts 
defined in equation (9). Hence, we will have 


P r ( s t+i = j | s t = i, si,..., s t - 1 ) 


nq 

nu+p 

0 

nn+P 


j = h 

s t+ i takes a new state, 


( 11 ) 


where nu = Yj\>= i S(s t >,i)5(st'+i,i) denotes the counts of transitions that have occurred 
so far from state i to itself. 1 Note that in equation (11), St+i depends only on the state 
that St takes according to the Markovian property. All other previous states merely 
provide the transition counts. 


We introduce a self-transition prior mass a for each state. The idea here is that if 
st transits to a new state, say St+i = * + 1, then without previous transition records, 
the next state s t +2 conditioned on s t+1 = i + 1 will further take another new state with 
probability 1. Hence, with a , the trivial case is avoided and we have 

{ nu+a j = i 

nu+P+a ’ (12) 

nu+p+a St + 1 tak<3S a neW State - 


Therefore, the whole Markov chain is characterized by two parameters, a and /3, 
instead of a transition probability matrix. We can see that a controls the prior tendency 
to linger in a state, and /3 controls the tendency to explore new states. Figure 1 illustrates 
three left-to-right Markov chains of length n = 150 with different a and /?. Figure la 
depicts the chain that explores many new states with very short linger time. Figure lb 
shows the chain with long linger time and less states. Figure lc lies in between. 


Equation (12) coincides with Chib (1998)’s model when the probability pu is inte¬ 
grated out. Specifically, in Chib (1998), 


pr(s t+ i = j | s t = i, si,..., s t _i) 


= / p{st +1 


nu+a 

nq+a+b 

b 

. nq+a+b 


j\s t = i,si,..., st-i,pu)f(pii)dp li 

j = i, 
j = i + 1 


(13) 


where pa ~ Beta(a, b) and f(pu) is the corresponding density. However, our model stems 
from a different perspective. The derivations in the previous section and equation (12) 
follow the nonparametric Bayesian literature (Neal (2000)) and the infinite HMM of Beal 
et al. (2002). Indeed, it is known that when the Dirichlet process is truncated at a finite 
number of states, the process reduces to the generalized Dirichlet distribution (GDD), 
see Connor and Mosimann (1969) and Wong (1998). For the same reason we have (12) 
coincides with (13). We would like to point out that our modeling strategy facilitates 
the Gibbs sampler of s t which is different from Chib (1998). We will elaborate further 
on the Gibbs sampler in the next section. Also, learning of a and /3 will be discussed in 
Section 6. 


1 The Kronecker-delta function <5(a, b) = 1 if and only if a = b and 0 otherwise. 









280 


Dirichlet Process Hidden Markov Multiple Change-point Model 





(a) a = 1, /3 = 5. 


(b) a = 5, P = 1. 


(c) a = 5, /3 = 5. 


Figure 1: Left-to-right Markov chain with different a and /3. 


4 Markov Chain Monte Carlo Algorithm 

4.1 General 

Suppose we have observations Y n = (j/i, J/ 2 , ■ ■ •> Un)' ■ Given the state S n = (si,..., s n )' 
we have 

yt I s t ~p(y t | (14) 

where 0 St £ R z , y t _i = (j/i,..., yt-i)'- Let 0 = (0i,..., Ok)' and 7 denotes a hyperpa¬ 
rameter. Recall that we impose the DPHMM to the states and a hierarchical model to 
the parameter, we are thus interested in sampling from the posterior p(0,S n ,^ \ Y n ) 
given the priors p(0 17 ), p(j) and p(S n ). The general Gibbs sampler procedure is to 
sample the following in turn: 

Step 1. S n | 0,7,y n , 

Step 2. 0 | 7 , S n ,Y n , 

Step 3. 7 | 0,S n ,Y n . 

We will discuss the three steps below. 

4.2 Simulation of S n 

The state prior p(S n ) can be easily derived from (12). Moreover, the full conditional is 

p(S n | 0,7,y„) oc p{S n )p(Y n | S n ,0, 7 ). (15) 

Simulation of S n from the full conditional (15) is done by the Gibbs sampler. Specifically, 
we draw St in turn for t = 1 , 2 , ..., n from 

p{s t | S' t _i,S' t+ 1 , 0 , 7 ,y ri ) oc p(s t | st-i,S t - 2 )p(s t+ i | s t ,S t+2 )p(y t \ s t ,Y t _ 1 , 0 , 7 ), 

(16) 
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where St -1 = (si,..., St-i)' and S t+1 = (st+i, ■ ■ ■, s n )'. The most recent updated 
values of the conditioning variables are used in each iteration. Note that we write 
p(st | St-1, St- 2 ) and p(st +1 | st,S t+2 ) to emphasize the Markov dynamic; the other 
conditioning states merely provide the counts. 

With the left-to-right characteristic of the chain, we do not have to sample all St 
from t = 1 to T. Instead, we only need to sample the state in which a change-point takes 
place. To see this, let us consider a concrete example. Suppose from the last sampler, 
we have 


Si 

S2 

S 3 

S 4 

S5 

S6 

S 7 

S 8 

s 9 

S10 

Sll 

1 

1 

1 

2 

2 

3 

3 

3 

4 

5 

5 ••• 


With left-to-right transition restriction, s t requires a sampling from (16) if and only if 
St_ 1 and s t+ i differ. For other cases, s t is unchanged with probability one. Suppose we 
are at t = 2. Since si and S 3 are both equal to 1, S 2 is forced to take 1. In the above 
chain, the first state that needs to sample from (16) is S 3 , which will either take the 
values 1 or 2 (s 2 or S 4 ). If S 3 takes 2 in the sampling (i.e., joining the following regime), 
then the next state to sample would be S 5 ; otherwise (i.e., joining the preceding regime), 
the next state to sample is S 4 because S 5 — S 3 ^ 0 for S 3 = 1. Now suppose we are at 
t = 9. We will draw a new Sg and s 9 will either join the regime of ss or the regime 
of siq . This will look strange because a gap exists in the chain. However, our concern 
here is the consecutive grouping or clustering in the series. We can alternatively think 
that the state represented by sg is simply pruned away in the current sweep. Note the 
numbers assigned to the st s are nothing but indicators of regimes. Therefore, we will 
relabel the s*’s after each sweep. 

In general, suppose St_i = i and st+i = i + 1. St takes either i or i + 1. Table 1 shows 
the corresponding probability values specified in (16). To see this, if st takes i, then the 
transition from s t -i to s t is a self-transition and that from s t to s t +i is an innovation. 
The corresponding probability values are in the first row of Table 1. The reasoning for 
St = i + 1 is similar. Note the changes of counts in different situations. 


Table 1: Sampling probabilities of St- nu = X)t'=i ^( s f> *)^( s t'+ij *) and ni_|_i,i+i = 
T.t'Zl+iHst'P + l)d(s t '+i,i + 1 ). 



p{st | st 1 = i, S t - 2 ) 

p(st+i = * + 1 | s t , S t+2 ) 

p{yt | St, Ft-1,0) 

st = i 

nu + a 

P 

p(yt 1 Y t -i, 0i) 

nu+ (3 + a 

Tlti + 1 + j3 + ex 

s t = i+1 

P 

^i+lji+l + OL 

p{y t 1 Y t -i,6i+i) 

nu+ P + a 

^i+l,i+l + P + OL 


For the initial point si, if currently S 2 — si ^ 0, then we can sample si from 


pr(si | s 2 ,s 3 ,... ,s„) 


P+a 

P 

P+a 


P+a 


■p(yi 

2 +« 


n S 2 S 2 +P+a 


Y 0 ,o si ) 
p{yi I y 0 ,6 s 


if S! unchanged, 
if si = s 2 , 


(17) 


2 This will exclude the case of the regime with only one data point. We do not consider this situation 
here. 
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where n S2S2 = ^™,_ 2 S 2 )< 5 (st'+i, S 2 ) and c is the normalizing constant. For the end 

point s n , if s n — s n _i ^ 1 , we sample s n from 




/ I \ _ J +/3+a 


™ 3 n-l 3 rt-l +/3+ a 


where n Sn _ lSn _ 1 = i5(s t /, s„_i)5(s t / + i, s,, 


* P(]Jn | ^n—1 i^s n _i) If — ^n—1; 

■p(y n | Y n _i,0 s J if s„ unchanged, 

(18) 

i) and c is the normalizing constant. 


As mentioned in Section 3, the DPHMM facilitates the Gibbs sampler of states. 
In sampling St from (16), we simultaneously use up all information of the transitions 
prior to t (i.e. si,..., St-i) and after t (i.e. St+i> • • •, st ) which are captured in p(st \ 
St—i , St— 2 ^) and p(st +1 | St,S t+2 ). Thus far, our algorithm only requires a record of 
transitions and draws at the point where the structural change takes place, whereas we 
have to sample all St in Chib (1998). 


4.3 Updating 6 and 7 

Given S n and Y n , the full conditionals of 0 and 7 are simply 

p(8i I l,S n ,Y n )cLp{0i | 7 ) p(y t \Yt-!,9i), 

{t:s t =i} (19) 

p{ 7 I 8, S n , Y n ) oc p(j) p(0 I 7 , S n , Y n ), 

which are model specific. In the following sections, we will study a simulated normal 
mean-shift model, a discrete type Poisson model, and an Ar(2) model. 


4.4 Initialization of States 


In Section 4.2, we have discussed the simulation of the states S n . The number of change- 
points is inherently estimated through the sampling of states in equation (16). Within 
the burn-in period, the state number will be changing around after each MCMC pass. 
After the burn-in period, the Markov chain converges and hence the number of states 
becomes stable. Theoretically, it is legitimate to set any number of change-points in the 
beginning and let the algorithm find out the convergent number of states. In practice, 
we find it is more efficient to initialize with a large number of states and let the algo¬ 
rithm prune away redundant states, rather than allow for the change-point number to 
grow from a small number. Specifically, suppose a reasonably large state number k is 
proposed. We initialize equidistant states, that is 


St = i, 


(i — 1 ) • n i ■ n 


( 20 ) 


where i = 1,..., k. Then the algorithm described above will work out the change-point 
locations and the number of states after convergence of the Markov chain. 
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5 A Monte Carlo Study: the Normal Mean-Shift Model 

5.1 The Model 

In this section, we first study the normal mean-shift model with known variance o 1 . 
Suppose the normal data Y n = (y i,..., y n )' is subject to unknown k changes in mean. 
We use the following hierarchical model 

yt | Oi ~ N(0j, er 2 ) if n-i <t <n, 

9i | y,v 2 ~ N(/r,u 2 ), (21) 

{g,v 2 ) ~ Inv-Gamma(z; 2 | a, b), 

where o 2 is known and r, (* = 1,..., k) is the change-point. We set r 0 = 0 and t ^+i = n. 
Next, we apply our algorithm to the case of unknown variance. In addition to (21), we 
assume the Inverse-Gamma prior for <r 2 

o 2 ~ Inv-Gamma(cr 2 | c,d). (22) 

All derivations of the full conditionals and the Gibbs samplers are given in the Appendix. 

We simulate two normal sequences with the parameters specified in Table 2. Specif¬ 
ically, Model 1 is subject to one change-point occurring at t = 50. Model 2 is a two 
change-points model with breaks at t = 50 and t = 100. Both models assume variance 
ct 2 = 3. Two realizations with respect to Model 1 and Model 2 are shown in Figure 2. 
We can see the overlapping of the data ranges of different regimes and it is hard to 
visually identify the change-points. 




(a) Model 1. (b) Model 2. 


Figure 2: Random realizations of Model 1 and Model 2. 
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Table 2: Normal Mean-Shift Models 1 and 2. 



0i 

02 

03 


Tl 

t 2 

k 

n 

Model 1 

(One change-point) 

1 

3 

- 

3 

50 

- 

1 

150 

Model 2 

(Two change-points) 

1 

3 

5 

3 

50 

100 

2 

150 


5.2 Simulation Results 

To implement our algorithm, we set the inverse-Gamma hyperparameters a = b = c = 
d = 1, and the DPHMM parameters a = 3 and /3 = 2. The two Gibbs samplers for 
the cases of known and unknown variance are conducted for 5000 sweeps with 5000 
burn-in samples, respectively. The 5000 sweeps after the burn-in period are thinned 
with 50 draws to reduce dependence of iterations. The first column of Figure 3 shows 
the probabilities of regime indicator St = i of the two models. Intersections of the lines 
St = i clearly demonstrate the break locations. 

To compare our proposed DPHMM to Chib’s method, we also report the posterior 
inference of Chib’s model under the true change-point number and the same model 
specification as in (A-3) and (A-4). 3 The posterior means and standard deviations of 
parameters are summarized in Table 3. First, our method performs well in all cases 
where the posterior distributions concentrate on the true values. The sample first-order 
serial correlations demonstrate good mixing of the samplers. Second, our results are 
comparable to those estimated from Chib’s model. The Bayes factors show that in most 
of the cases the models with the true number of change-points are preferred to others. 
For example, in Model 2 where two change-points exist, the Bayes factors comparing 
models with k = 1 versus k = 2 are close to zero favoring the two change-point model. 
Likewise, the Bayes factors comparing k = 2 versus k = 3 favor the two change-point 
model. Hence we conclude that the model with two change-points is correctly specified 
with high probability. However, the Bayes factor fails to detect the correct number of 
changes in Model 1 with unknown variance. The values suggest a model with two change 
points. The posterior probabilities of states estimated by Chib’s model are shown in the 
second column of Figure 3. In summary, the simulation results demonstrate that our 
algorithm works well in the normal mean-shift models and is robust to the change-point 
number compared to Chib’s model. 


5.3 Robustness Check of Change-Point Number 

In this section, we study the robustness of our algorithm in detecting the true number 
of change-points. Although our method does not require prespecification of the change- 
point number, it is still possible that our algorithm fails to estimate the correct number 
of change-points. Thus, we replicate the entire estimation process as in the previous 


3 The prior of the transition probabilities in Chib (1998)’s model is assumed to be Beta(a, b). The 
parameters a and b are chosen to reflect equidistant duration of each state. For example, in the case of 
one change-point with n = 150 sample size, we take b = 0.1 and a = n/2 X b = 7.5, i.e. Beta(7.5, 0.1). 
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DPHMM: Model 1 with known variance. Chib: Model 1 with known variance. 




DPHMM: Model 1 with unknown variance. Chib: Model 1 with unknown variance. 




DPHMM: Model 2 with known variance. 


Chib: Model 2 with known variance. 




DPHMM: Model 2 with unknown variance. 


Chib: Model 2 with unknown variance. 


Figure 3: Posterior probability of st = i- DPHMM vs. Chib’s model. 
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Table 3: Posterior estimates of Normal Mean-Shift Models 1 and 2. Mean and SD denote, 
respectively, posterior mean and posterior standard deviation. 






Mean 

Model 1 

Known variance Unknown variance 

SD True value Mean SD True value 







DPHMM 





0 i 


0.9123 

0.2519 

1.0000 

0.9154 

0.2432 

1.0000 



o 2 


2.9375 

0.1812 

3.0000 

2.9354 

0.1723 

3.0000 



a 2 





2.8244 

0.3343 

3.0000 







Chib’s Model with k = 1 





0 i 


0.8980 

0.1889 

1.0000 

0.9521 

0.2441 

1.0000 



0 2 


2.9520 

0.1324 

3.0000 

2.9331 

0.1619 

3.0000 



a 2 





2.7933 

0.4121 

3.0000 







Bayes Factor Analysis 



k = 

1 

vs. k = 

2 

1.730 


0.548 


k = 

1 

vs. k = 

3 

1.374 


1.010 


k = 

2 

vs. k = 

3 

0.794 


1.850 








Model 2 








Known variance 

Unknown variance 





Mean 

SD 

True value 

Mean 

SD 

True value 







DPHMM 





0 i 


1.1746 

0.2777 

1.0000 

1.2770 

0.2584 

1.0000 



e 2 


2.9758 

0.2874 

3.0000 

3.2176 

0.2672 

3.0000 



o 3 


5.3344 

0.2459 

5.0000 

5.1108 

0.2682 

5.0000 



a 2 





3.1102 

0.3726 

3.0000 







Chib’s Model with k = 2 





0 i 


1.1770 

0.2052 

1.0000 

1.3680 

0.2643 

1.0000 



0 2 


2.9480 

0.1916 

3.0000 

3.1920 

0.2730 

3.0000 



63 


5.3320 

0.1823 

5.0000 

5.0310 

0.2461 

5.0000 



a 2 





3.0780 

0.6794 

3.0000 







Bayes Factor Analysis 



k = 

1 

vs. k = 

2 

0.000 


0.0193 


k = 

1 

vs. k = 

3 

0.000 


0.0583 


k = 

2 

vs. k = 

3 

3.090 


3.0332 



section for 1000 times and record the estimated change-point number in each replication. 
Specifically, in each of the 1000 replications, we iterate the Gibbs sampler for 5000 times 
and the change-point number of the last sample is recorded. Therefore, we obtain 1000 
collections of change-point numbers. Table 4 reports the frequencies of the detected 
change-point numbers. We see that with high frequency (over 99%) our method detects 
one change-point (k = 1) in Model 1 in both cases of known and unknown variance. 
In Model 2, our results show that over 90% of the 1000 replications detect two change- 
points (k = 2), and over 99% detect at least one change-point. The figures demonstrate 
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that our algorithm correctly detects the change-point number with high probability in 
different cases. 


Table 4: Frequencies of estimated change-point numbers. 



Known variance 

Unknown variance 


k = 0 

k = 1 

k = 2 

k = 0 k = 1 k = 2 

Model 1 

(One change-point) 

Model 2 

(Two change-points) 

0.3% 

99.7% 

0 .0% 

0.5% 99.5% 0.0% 

0 .0% 

6.5% 

93.5% 

0.2% 8.7% 91.1% 


6 Learning cr and j3 

In the previous section, we set the DPHMM parameters a = 3 and /3 = 2 in estimating 
the simulated models. In order to learn about a and P, we propose to use vague Gamma 
priors, see Beal et al. (2002). Note that with the number of states specified in each 
MCMC sweep, the DPHMM reduces to the generalized Dirichlet distribution (GDD), 
see Connor and Mosimann (1969) and Wong (1998). Hence the posterior is 

fc+l 

p{a,/3 | S n ) oc Gamma(a Q ,& Q )Gamma(a / ?,^) ^ r( T m ^ m r\ ’ ( 23 ) 

f=i r ( a ) r(n li + 1 + a + p) 

where nu = ^2j = i 5(st,i)6(st+i,i) denotes the counts of self transitions. We set a a = 
b a = ap = bp = 1 here and in the subsequent sections. Below we consider two alternative 
approaches for sampling: the first based on maximum-a-posteriori (MAP) estimation 
and a second approach using a random walk sampler. 


6.1 The Maximum-a-Posteriori 


We first solve for the maximum-a-posteriori (MAP) estimates for a and /3 which are 
obtained as the solutions to the following gradients using the Newton-Raphson method, 


91np(a,/3 | S n ) a a — 1 


da 


-b n 


k+1 


+ ^2 [^( a + P) + 4>{nu + a) - - ip(nu + 1 + a + /?)] = 0 


*=l 


(24) 


dlnp(a,/3 | S n ) ap — 1 


k+1 

E 


- + ip (a + P) - ip(nu + 1 + a + p) 


= 0 , 


where ip{-) is the digamma function defined as ip(x) = d\nT(x)/dx. 

We implement our algorithm in the previous section together with the MAP update 
of a and P in each sweep. The DPHMM with MAP update correctly detects the true 
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Table 5: MAP of a and /3 in Normal Mean-Shift Models 1 and 2. Average MAP values of 
a and /3 are reported with standard deviations within parentheses. For other parameters, 
the values are posterior means and posterior standard deviations. 



known 

Model 1 

variance unknown variance 

known 

Model 2 

variance unknown variance 

a 

0.6353 

(0.0007) 

0.6353 

(0.0007) 

0.9453 

(0.0008) 

0.9451 

(0.0005) 

p 

0.1937 

(0.0005) 

0.1937 

(0.0005) 

0.2364 

(0.0001) 

0.2361 

(0.0005) 

6i 

0.9145 

(0.2560) 

0.9049 

(0.2430) 

1.1843 

(0.2847) 

1.2675 

(0.2585) 

@2 

2.9326 

(0.1745) 

2.9385 

(0.1731) 

2.9859 

(0.2892) 

3.2222 

(0.2716) 

o 3 





5.3349 

(0.2468) 

5.1026 

(0.2661) 

a 2 



2.8207 

(0.3340) 



3.0908 

(0.3752) 


number of change-points in all cases. Table 5 shows the MAP solutions for a and /3, and 
the posterior estimates of all parameters in each model. We can see that the average 
MAP values of a and /3 are 0.6353 and 0.1937 respectively in Model 1 with known and 
unknown variance. The results are slightly different in Model 2 such that average MAP 
values are 0.9451 and 0.2360 respectively. We also report the sample standard errors 
which show evidence of stability of the MAP values after the burn-in period. In all 
cases, a is greater than [3 indicating that the algorithm tends to linger in existing states 
rather than exploring a new one. Besides, all parameter estimates are in line with the 
results in the previous section when a and (3 are prespecified. 


6.2 The Metropolis-Hastings Sampler 

We also consider a Metropolis-Hastings (M-H) sampler for the posterior (23). The 
candidate-generating density is assumed to be the random walk process with positive 
support 

f(a'\a) oc (j){a! — a), a' > 0, (25) 

where a is the value of the previous draw, </>(•) is the standard normal density function. 
The acceptance ratio given f3 is thus 


A(a , a') 


p(a',/3 | S n ) 5>(a) 
p(a,P | S n ) $(«')’ 


(26) 


where $(•) is the standard normal distribution function. The same M-H sampler is also 
applied to (3 given the updated a. We incorporate the M-H sampler of a and ft in the 
Gibbs sampler in Section 5. The posterior estimators are shown in Table 6. The posterior 
means and standard deviations of parameter 0i and <j 2 are similar to those obtained in 
the previous analyses. The posterior mean of a is greater than the posterior mean of f3 
in all models. This affirms the conclusion in the MAP results that the algorithm tends 
to linger in existing states rather than exploring a new one. Both the MAP and M-H 
methods correctly estimate the number of change-points in each simulation. 
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Table 6: M-H sampler of a and /? in Normal Mean-Shift Models 1 and 2. Posterior 
means and posterior standard deviations within parentheses. 

Model 1 Model 2 


known variance unknown variance known variance unknown variance 


a 

1.8073 

(1.3345) 

1.8145 

(1.3646) 

2.2007 

(1.5179) 

2.1482 

(1.4731) 

p 

0.3446 

(0.2168) 

0.3455 

(0.2176) 

0.3667 

(0.2087) 

0.3622 

(0.2086) 

Oi 

1.1182 

(0.2521) 

1.0839 

(0.2399) 

1.1256 

(0.2558) 

1.0892 

(0.2457) 

@2 

2.9447 

(0.1724) 

3.0939 

(0.1701) 

2.8277 

(0.2644) 

3.0341 

(0.2449) 

#3 





5.0138 

(0.2461) 

5.1172 

(0.2454) 

(T 2 



2.8190 

(0.3298) 



2.8793 

(0.3455) 


6.3 Comparison between MAP and M-H 

We can see that the estimates of a and (3 from the two approaches are quite different as 
shown in Tables 5 and 6. The MAP as a point estimator may not reflect the variations of 
a and /?, whereas the M-H is a typical Bayesian method which can be incorporated into 
the MCMC sampler of other parameters in question. Moreover, the MAP approach may 
be limited when the posterior happens to be multi-modal. Therefore, the M-H method 
is preferred in practice and the following empirical studies are conducted with the M-H 
sampler. 


7 Empirical Applications 

7.1 Poisson Data with Change-Point 

We first apply our Dirichlet process multiple change-point model to the much analyzed 
data set on the number of coal-mining disasters by year in Britain over the period 
1951-1962 (Jarrett (1979), Carlin et al. (1992) and Chib (1998)). 

Let the disaster count y be modeled by a Poisson distribution 

f(y | A) = X v e~ x /yl. (27) 

The observation sequence Y n = (yi, yi, ..., ywi)' is subject to some unknown change- 
points. We plot the data y t in Figure 4. Chib (1998) estimates the models with one 
change-point (k = 1) and with two change-points (k = 2), respectively. He assumes the 
parameter A following the prior Gamma(2,1) in the one-change-point case and the prior 
Gamma(3,1) in the other. Hence, given the regime indicators S n , the corresponding 
parameter A i in regime i has the following posteriors with respect to the two priors: 


Posterior 1: 

\ | S n , Y n ~ Gamma(Ai 

| 2 + Uul + Ni), 

(28) 

Posterior 2: 

\i | S n , Y n ~ Gamma(Ai 

| 3 + Ui,l + Ni), 

(29) 
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Time 


Figure 4: Data on coal mining disaster count yt- 




Figure 5: Posterior probability of s* = i. 


where Ui = <5(si,*)z/t aR d iV* = Xit-Li S(st,i). We perform our algorithm with the 

following Gibbs steps: 

Step 1. Sample S n \ X,Y n as in (16) and obtain k, 

Step 2. Sample A * | S n ,Y n as in (28) or (29), 

Step 3. Update a and /3 with the Metropolis-Hastings Sampler as in Section 6.2. 

The above Gibbs sampler is conducted for 5000 sweeps with 1000 burn-in samples. To 
reduce the sampler dependency, the 5000 sweeps are thinned by 50 draws. The sampler 
estimates one change-point in the data. Figure 5 shows the posterior probabilities of the 
regime indicator s t = i at each time point t. The intersections of the two lines s t = 1 
and St = 2 show that the break location exits at around t = 40. Figure 6 provides the 
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(a) Under prior 1. 


(b) Under prior 2. 


Figure 6: Posterior probability mass function of change-point location r,. 


distribution of the transition points r,;. Interestingly, our model produces exactly the 
same figure as the one in Chib (1998). The change-point is identified as occurring at 
around t = 41. 

The corresponding posterior means of the parameters Ai and A2 are 3.1006 and 
0.9387 with posterior standard deviations, 0.2833 and 0.1168, respectively, under the 
prior Gamma(2,1). The posterior means of a and /? are 1.8101 and 0.3697 with standard 
deviations 1.3577 and 0.2464. When using the prior Gamma(3,1), we have the posterior 
means of Ai and A2 equal to3.1308 and 0.9567 with posterior standard deviations 0.2877 
and 0.1218, respectively. The posterior means of a and /3 are 1.8375 and 0.3715 with 
standard deviations 1.3456 and 0.2360, respectively under prior 2. All our results closely 
match those of the literature and we show a certain robustness of our model under 
different prior assumptions. 

In order to check the robustness of the estimation of the number of change-points, k, 
we conduct 1000 replications of the above estimation process and collect 1000 change- 
point numbers. When the first prior is assumed, 77.23% of the 1000 replications detect 
one change-point. We find a similar result for prior 2. Hence, we conclude that without 
assuming the number of change-points a priori , our algorithm detects the same change- 
point number as in the model developed by Chib (1998) with high probability. 

7.2 Real Output 

We also apply our algorithm to estimate structural changes in real Gross Domestic 
Product growth. The data and model are drawn from Maheu and Gordon (2008) (see 
also Geweke and Yu (2011)). Let y t = 100[log(g t /g t _i) — log(p t /p t _i)], where q t is quar¬ 
terly US GDP seasonally adjusted and pt is the GDP price index. The data range from 
the second quarter of 1947 to the third quarter of 2003, for a total of 226 observations 
(see Figure 7). We model the data with a Bayesian AR(2) model with structural change. 
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Figure 7: US real GDP growth from the second quarter of 1947 to the third quarter of 
2003. 


The frequentist autoregressive structural change-model can be found in Chong (2001). 
Suppose the data are subject to k change-points and follow 

Ut = A),s t + Pi,s t yt-i + p2,stVt-2 + £tj ~ N( 0 , <r 2 t ), St = 1 , 2,. .., k + 1 . (30) 

We assume the following hierarchical priors to fti,i,fty and 02,i'- 

Pt = {Po,i,Pi,i,M ~N(/z,n i = l,...,fc + l, ( 31 ) 

where g = (go, gi, g2)' and V = Diag(ug, v\, uf), such that 

p(gj,Vj) oc Inv-Gamma(w| | a, b), j = 0,1, 2. (32) 

We assume the noninformative prior for a'f such that 

Pitrf) oc 1/crf, i = 1,... ,k + 1. (33) 

Conditional on of, the sampling of Pi, g and V is similar to Section 5. For erf, we 
can draw from the following full conditional: 

<?i I ft, S n , Y n - Inv-x 2 ( cr 2 | n - n _i, ——-, i = 1, + 1, (34) 

where ujf = Y, Ti - 1 <t<T t (yt “ ~ Pi,s t yt-l - 2 ) 2 - 

As in the previous applications, we set the inverse-Gamma hyperparameters 
a = b = 1. The M-H update of a and /3 follows the discussion in Section 6.2. The Gibbs 
sampler is conducted for 5000 sweeps with 1000 burn-in samples. The 5000 sweeps are 
thinned by 50 draws. The posterior probabilities of the regime indicator st in Figure 8a 
suggest that the structural break exists between the years 1980 and 1990. Figure 8b 
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(a) Posterior probability of st = i. (b) Posterior probability of break location t*. 


Figure 8: US real GDP growth structural change model. 


Table 7: US real GDP growth structural change model with one change-point. Posterior 
means and posterior standard deviations within parentheses. The results applying Chib’s 
model are drawn from Maheu and Gordon (2008). 



Chib’s 

Model 

DPHMM 


St = 1 

s t =2 

s t = 1 

s t = 2 

00 ,St 

01 , s t 

02 , s t 

0.5642 (0.1228) 
0.2716 (0.0734) 
0.0800 (0.0739) 
1.3331 (0.1542) 

0.4434 (0.1162) 
0.2792 (0.1052) 
0.1588 (0.1010) 
0.3362 (0.0516) 

0.5499 (0.1303) 
0.2812 (0.0837) 
0.0913 (0.0855) 
1.4089 (0.1722) 

0.3894 (0.1169) 
0.2796 (0.1173) 
0.2253 (0.1124) 
0.2672 (0.0460) 


further shows the change-point at the second quarter of 1983, which is close to the re¬ 
sults in Maheu and Gordon (2008). 4 The posterior estimates are summarized in Table 7. 
Finally, the posterior means of a and /3 are 1.7749 and 0.3045 with standard deviations 
1.3422 and 0.1939 respectively. All of our results are consistent with Chib’s estimates. 

Finally, we replicate 1000 times the whole estimation process and check the robust¬ 
ness of the detected change-point number. The result suggests that nearly 100% of the 
replications detect one break point. 


8 Concluding Remarks 

In this paper, we have proposed a new Bayesian multiple change-point model, that is 
the Dirichlet process hidden Markov model. Our model is semiparametric in the sense 
that the number of states is not built-in to the model but endogenously determined. 
As a result, our model avoids the model misspecification problem. We have proposed 
an MCMC sampler which only needs to sample the states around change-points. We 


4 See Figure 4 in Maheu and Gordon (2008). 
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have also proposed the MAP and M-H updates of hyperparameters in the DPHMM 
process. We have presented three specific models, namely, the discrete Poisson model, 
the continuous normal model, and the Ar( 2) model with structural changes. Results 
from the simulations and empirical applications showed that our Dirichlet process hid¬ 
den Markov multiple change-point model detected the true change-point numbers and 
locations with high accuracy. 


Appendix 

In the appendix, we give the derivations of the full conditionals and the Gibbs samplers 
in Section 5. For the case of known variance, we first rewrite the hierarchical model (21) 
as the joint distribution 

fc+l k+1 

p{Y n ,e,p,v 2 I S„,c7 1 ) oc I 6>i,of) I g,v 2 )p(p,v 2 ), (A-l) 

2=1 2=1 

where p(p, v 2 ) corresponds to Inv-Gamma(u 2 | a, b), 9 = {0 \,..., Ok+i)' an d 




Vi =-——- 1 —, and q 2 =-. 

7~i 'Ti —i Ti Ti —1 


From (A-l) and (A-2), we have the following full conditionals: 

fc+i 

p(0i I P,v 2 ,q 2 ,S n ,Y n ) oc N (yi | 0 i ,CT 2 )N(6» i | g,v 2 ) 


(A-2) 


oc n e. 


1 


yi/q 2 +p/v 2 _ 

1 /cr 2 + 1/v 2 ’ 1 /cr 2 + l/v 2 


k+1 


p(p | 0, v 2 , S n , Y n ) oc N(0j | /i, v 2 )p(p , v 2 ) 


2=1 


oc N (fj, I 0,v 2 /(k + 1)), 


fc+i 


p(v 2 I d , /X, 5„, y„) oc (u 2 ) (fe+1)/2 exp -- ^(0* - m) 2 /^ 2 > p{p, v 2 ) 


oc Inv-Gamma \ v 2 I a + 


2=1 
fc + l 


fc+i 




(A-3) 


where 9 = X)j=i $i/(& + !)• Therefore, we can perform the following Gibbs sampler: 

Step 1. Sample S n | 9,p,v,Y n as in (16) and obtain k, 

Step 2. Sample 9,p,v \ S n ,Y n as in (A-3). 


For the case of unknown variance, the full conditional with respect to (22) is 


q 2 | 9, S n , Y n ~ Inv-Gamma ( q 2 


1 n 

c + 7+ + - 1 


(A-4) 


t=l 
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Conditional on a 2 , we apply the same estimation strategy discussed above. The Gibbs 

sampler is thus 

Step 1. Sample S n | 9, p,v, o 2 ,Y n as in (16) and obtain k, 

Step 2. Sample 0,g,v,cr 2 \ S n ,Y n as in (A-3) and (A-4). 
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